Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new index to postgres datastore for GC #1550

Merged
merged 1 commit into from
Oct 3, 2023
Merged

Conversation

vroldanbet
Copy link
Contributor

@vroldanbet vroldanbet commented Sep 28, 2023

This adds 1 new indexes to optimize:

  • GC operations

⚠️ important! This index will not be selected by the query planner in PG 13 and 14 due to the usage of the xid8 type

add tuned GC index

the original GC index was rarely selected
when GC'ing relations with millions of relationships.
This caused a sequential scan that could load potentially
Gigabytes worth of data, and overload the database.

⚠️ Postgres 13 and 14 do not correctly select this new index
because the query planner struggles with the xid8 datatype.
Postgres 15 will be needed for this index to
actually work. For scenarios where relations aren't very wide,
older PG versions may have an acceptable query latency for GC,
but we strongly recommend moving to PG15.

This migration also deletes the old suboptimal index after
the new one is created.

@vroldanbet vroldanbet requested a review from a team September 28, 2023 09:03
@github-actions github-actions bot added the area/datastore Affects the storage system label Sep 28, 2023
@vroldanbet vroldanbet self-assigned this Sep 28, 2023
@jzelinskie jzelinskie added the area/perf Affects performance or scalability label Sep 28, 2023
@josephschorr
Copy link
Member

@vroldanbet I wonder if we should have a migration to run ANALYZE TABLE after the concurrent indexes are added to ensure the stats are correct for them?

@vroldanbet
Copy link
Contributor Author

@josephschorr from this post:

A table of 40 Million rows, the auto vacuum will run when the table will receive 8 Million updates or deletes. Similarly the table needs to receive 4 Million updates or deletes in order for the auto analyze to start. Mostly tables of such size would become slow before this threshold is received, due to which manually VACUUM FULL ANALYZE is recommend on a regular basis.

This other post suggests that analyze would be necessary if index has an expression to it, which is the case here:

If the index is just on simple columns like in your case, it is not necessary to ANALYZE the table after you create the index.
That is because statistics on the value distribution of the table columns are always collected, no matter if there is an index on the column or not.

However, if you are indexing an expression like upper(some_column) or (CAST(some_column AS date)), you should run ANALYZE after creating the index.
PostgreSQL will then also collect statistics on the value distribution of the indexed expression. This happens automatically whenever autoanalyze runs, but it is a good idea to do it manually right after creating the index so you have good statistics right away.

My concerns here are:

  • is running ANALYZE effective if the index is being created CONCURRENTLY?
  • Could running ANALYZE to a PG running already a CREATE INDEX statement add sufficient load to disrupt service?

@vroldanbet vroldanbet enabled auto-merge October 2, 2023 12:51
@josephschorr josephschorr changed the title Add new indexes to postgres datastore Add new index to postgres datastore Oct 2, 2023
@josephschorr
Copy link
Member

  • is running ANALYZE effective if the index is being created CONCURRENTLY?

Unlikely, but I believe the migration will wait for the index to complete (concurrently) before the next one is run, so the ANALYZE should apply afterwards

Could running ANALYZE to a PG running already a CREATE INDEX statement add sufficient load to disrupt service?

Yes, but one would imagine that CREATE INDEX is not also running unless someone is running the migration command twice (at the same time)

@vroldanbet
Copy link
Contributor Author

@josephschorr

This blog post suggests running analyze should only be done when bulk loading, and that doing it in other situations is adding unnecessary pressure to the database:

Autovacuum also keeps a table’s data distribution statistics up-to-date (it doesn’t rebuild them). When manually run, the ANALYZE command actually rebuilds these statistics instead of updating them. Again, rebuilding statistics when they’re already optimally updated by a regular autovacuum might cause unnecessary pressure on system resources.

The time when you must run ANALYZE manually is immediately after bulk loading data into the target table. A large number (even a few hundred) of new rows in an existing table will significantly skew its column data distribution. The new rows will cause any existing column statistics to be out-of-date. When the query optimizer uses such statistics, query performance can be really slow. In these cases, running the ANALYZE command immediately after a data load to completely rebuild the statistics is a better option than waiting for the autovacuum to kick in.

Another post indicates that doing it after adding an index is actually a good idea:

You may want to run ANALYZE in these situations:

  • when the contents of a table has changed significantly. For example, when a few percents of records in a table have been added, updated or deleted.
  • just before or after adding an index to a table. That may help query planner to generate optimal query plans that will efficiently the new index.

I also tried to find evidences of whether ANALYZE was blocking. From the official docs:

Specifies that ANALYZE should not wait for any conflicting locks to be released when beginning work on a relation: if a relation cannot be locked immediately without waiting, the relation is skipped. Note that even with this option, ANALYZE may still block when opening the relation's indexes or when acquiring sample rows from partitions, table inheritance children, and some types of foreign tables. Also, while ANALYZE ordinarily processes all partitions of specified partitioned tables, this option will cause ANALYZE to skip all partitions if there is a conflicting lock on the partitioned table

This post suggests it does not block

While ANALYZE runs, other queries may access the table because the ANALYZE command does not block the table.

This other blog post also supports this idea:

Note that, in PostgreSQL, ANALYZE command doesn't read or update indexes. It deals only with table/column contents. It also doesn't block the table - other queries may read from the table while ANALYZE is running.

And FWIW, anecdotal evidence on a cluster with millions of rows show the index being picked immediately after run with CREATE INDEX CONCURRENTLY 🤷🏻

the original GC index was rarely selected
when GC'ing relations with millions of relationships.
This caused a sequential scan that could load potentially
Gigabytes worth of data, and overload the database.

⚠️ Postgres 13 and 14 do not correctly select this new index
because the query planner struggles with the xid8 datatype.
Postgres 15 will be needed for this index to
actually work. For scenarios where relations aren't very wide,
older PG versions may have an acceptable query latency for GC,
but we strongly recommend moving to PG15.

This migration also deletes the old suboptimal index after
the new one is created.
Copy link
Member

@josephschorr josephschorr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vroldanbet vroldanbet added this pull request to the merge queue Oct 3, 2023
Merged via the queue into main with commit 9035a64 Oct 3, 2023
25 checks passed
@vroldanbet vroldanbet deleted the add-new-indexes branch October 3, 2023 14:59
@github-actions github-actions bot locked and limited conversation to collaborators Oct 3, 2023
@vroldanbet vroldanbet changed the title Add new index to postgres datastore Add new index to postgres datastore for GC Apr 8, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/datastore Affects the storage system area/perf Affects performance or scalability
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants